arXiv: 1502.00831 v2 [cs.CL] 4 Feb 2015 


Open System Categorical Quantum Semantics 
in Natural Language Processing 

Robin Piedeleu*, Dimitri Kaitsaklis^, Bob Coecke* and Mehrnoosh Sadrzadeh' 

*University of Oxford, Department of Computer Science 
Wolfson Building, Parks Road, Oxford 0X1 3QD, UK 
Email: {robin.piedeleu;bob.coecke}@cs.ox.ac.uk 
TQueen Mary University of London, School of Electronic Engineering and Computer Science 

Mile End Road, London El 4NS, UK 
Email: {d.kartsaklis;m.sadrzadeh} @qmul.ac.uk 


Abstract —Originally inspired by categorical quantum mechan¬ 
ics (Abramsky and Coecke, LiCS’04), the categorical compo¬ 
sitional distributional model of natural language meaning of 
Coecke, Sadrzadeh and Clark provides a conceptually motivated 
procedure to compute the meaning of a sentence, given its 
grammatical structure within a Lambek pregroup and a vectorial 
representation of the meaning of its parts. The predictions of 
this first model have outperformed that of other models in 
mainstream empirical language processing tasks on large scale 
data. Moreover, just like CQM allows for varying the model in 
which we interpret quantum axioms, one can also vary the model 
in which we interpret word meaning. 

In this paper we show that further developments in categorical 
quantum mechanics are relevant to natural language processing 
too. Firstly, Selinger’s CPM-construction allows for explicitly 
taking into account lexical ambiguity and distinguishing between 
the two inherently different notions of homonymy and polysemy. 
In terms of the model in which we interpret word meaning, 
this means a passage from the vector space model to density 
matrices. Despite this change of model, standard empirical 
methods for comparing meanings can be easily adopted, which 
we demonstrate by a small-scale experiment on real-world data. 
This experiment moreover provides preliminary evidence of the 
validity of our proposed new model for word meaning. 

Secondly, commutative classical structures as well as their 
non-commutative counterparts that arise in the image of the 
CPM-construction allow for encoding relative pronouns, verbs 
and adjectives, and finally, iteration of the CPM-construction, 
something that has no counterpart in the quantum realm, enables 
one to accommodate both entailment and ambiguity. 

I. Introduction 

Language serves to convey meaning. From this perspective, 
the ultimate and long-standing goal of any computational 
linguist is to capture and adequately represent the meaning 
of an utterance in a computer’s memory. At word level, 
distributional semantics offers an effective way to achieve 
that goal; following the distributional hypothesis JTj which 
states that the meaning of a word is determined by its context, 
words are represented as vectors of co-occurrence statistics 
with all other words in the vocabulary. While models following 
this paradigm have been found very useful in a number of 
natural language processing tasks El-El, they do not scale 
up to the level of phrases or sentences. This is due to the 
capacity of natural language to generate infinite structures 


(phases and sentences) from finite means (words); no text 
corpus, regardless of its size, can provide reliable distributional 
statistics for a multi-word sentence. On the other hand, type- 
logical approaches conforming to the tradition of Lambek 0, 
Montague 0 and other pioneers of language, are composi¬ 
tional and deal with the sentence in a more abstract level based 
on the syntactical rules that hold between the different text 
constituents, but in principle they do not provide a convincing 
model for word meaning. 

The categorical compositional distributional model of Co¬ 
ecke, Sadrzadeh and Clark 0 addresses the challenge of 
combining these two orthogonal models of meaning in a 
unified setting. The model is based on the observation that a 
grammar expressed as a pregroup 0 shares the same structure 
with the category of finite dimensional vector spaces and linear 
maps, that of a compact closed category 0. In principle, this 
offers a canonical way to express a grammatical derivation 
as a morphism that defines linear-algebraic manipulations 
between vector spaces, resulting in a sentence vector. The main 
characteristic of the model is that the grammatical type of a 
word determines the vector space in which it lives. Words with 
atomic types, such as nouns, are represented by vectors living 
in some basic vector space N; on the contrary, relational words 
such as verbs and adjectives live in tensor product spaces 
of higher order. An adjective, for example, is an element of 
TV ® TV, while a transitive verb lives in TV (g) S (g> TV. The 
relational tensors act on their argument by tensor contraction , 
a generalization of the familiar notion of matrix multiplication 
to higher order tensors. 

Ambiguity is a dominant feature of language. At the lex¬ 
ical level, one can distinguish between two broad types of 
ambiguity: homonymy refers to cases in which, due to some 
historical accident, words that share exactly the same spelling 
and pronunciation are used to describe completely distinct 
concepts; such an example is ‘bank’, meaning a financial 
institution and a land alongside a river. On the other hand, the 
senses of a polysemous word are usually closely related with 
only small deviations between them; as an example, think of 
‘bank’ again as a financial institution and the concrete building 
where that institution is accommodated. These two notions 


of ambiguity are inherently different; while a polysemous 
word still retains a certain level of semantic coherence, a 
homonymous word can be seen as an incoherent mixing 
due to coincidence. The issue of lexical ambiguity and the 
different levels of it is currently ignored from almost all 
attempts that aim to equip distributional models of meaning 
with compositionality. 

The purpose of this paper is to provide the theoretical 
foundations for a compositional distributional model of mean¬ 
ing capable of explicitly dealing with lexical ambiguity. At 
a philosophical level, we define an ambiguous word as a 
probabilistic mixing of idealistically pure (in the sense of 
completely unambiguous) concepts. In practice, though, these 
pure concepts cannot be precisely defined or even expressed 
by words; no word is completely unambiguous, and its precise 
meaning can only be defined in relation to a relevant context 
Qa. Empirically, then, we can approximate these pure con¬ 
cepts by meaning vectors provided by a word sense induction 
method based on clustering the contexts in which a word 
occurs. We take the set of meaning vectors assigned to a 
specific word as representing distinct polysemous uses of the 
word, i.e. relatively self-contained concepts with a certain level 
of semantic coherence. An ambiguous word then corresponds 
to a homonymous case, where the same name is used for more 
than one semantically coherent concepts. 

In the proposed model we exploit the observation that the 
compact closed structure on which the original model of 
Coecke et al. 0 was based provides an abstraction of the 
Hilbert space formulation used in the quantum theory, in terms 
of pure quantum states as vectors, which is known under the 
umbrella of categorical quantum mechanics CD. In fact, the 
original model of Coecke et al. was itself greatly inspired by 
quantum theory, and in particular, by quantum protocols such 
as quantum teleportation, as explained in CD. Importantly, 
vectors in a Hilbert space represent the states of a closed 
quantum system, also called pure states. Selinger’s CPM- 
construction El, which maps any dagger compact closed 
category on another one, then adjoins open system states, 
also called mixed states. In the new model, these allow for a 
lack of knowledge on part of the system under consideration, 
which may be about an extended part of the quantum system, 
or uncertainty (read: ambiguity) regarding the preparation 
procedure. 

The crucial distinction between homonymous and polyse¬ 
mous words is achieved as follows: while a polysemous word 
corresponds to a pure quantum state, a homonymous word is 
given by a mixed state that essentially embodies a probability 
distribution over all potential meanings of that word. Mathe¬ 
matically, a mixed states is expressed as a density matrix: a 
self-adjoint, positive semi-definite operator with trace one. The 
new formulation offers many opportunities for interesting and 
novel research. For instance, by exploiting the notion of Von 
Neumann entropy one can measure how ambiguity evolves 
from individual words to larger text constituents; we would 
expect that the level of ambiguity in word ‘bank’ is higher 
than that of the compound ‘river bank’. 


Furthermore, the richness of the new category in which 
the meanings of words now live offers interesting alternative 
design options. In the past, for example, Sadrzadeh, Kartsaklis 
and colleagues urn on enriched the categorical composi¬ 
tional model with elements of classical processing, exploiting 
the fact that any basis of a finite-dimensional vector space 
induces a commutative Frobenius algebra over this space, 
which allows the uniform copying or deleting of the infor¬ 
mation relative to this basis El- As we will see in Sect. ED 
the dagger compact closed categories arising from the CPM- 
construction also accommodate canonical non-commutative 
Frobenius algebras which have the potential to account for 
the non-commutativity of language. 

Finally, we discuss how iterated application of the CPM- 
construction, which gives rise to states that have no inter¬ 
pretation in quantum theory, does have a natural application 
in natural language processing. It allows for simultaneous 
semantic representation of more than one language feature that 
can be represented by density matrices, for example, lexical 
entailment in conjunction with ambiguity. 

Outline Sect. QI] provides an introduction to categorical 
compositional distributional semantics; Sect. [Ill] explains the 
linguistic intuition behind the core ideas of this paper; Sect. [IV] 
gives the mathematical details for the extension of the original 
model to the quantum formulation; Sect. [V] discusses non- 
commutativity in this new context; finally. Sect. [VI] provides 
the basic intuition for yet another extension of the model that 
adds the notion of entailment to that of ambiguity. 

Related work The issue of lexical ambiguity in categor¬ 
ical compositional models of meaning has been previously 
experimentally investigated by Kartsaklis and Sadrzadeh El, 
who present evidence that the introduction of an explicit 
disambiguation step on the word vectors prior to composition 
improves the performance of the model in various sentence 
and phrase similarity tasks. 

Furthermore, the research presented here is not the only one 
that uses density matrices for linguistic purposes. Balkir lfl8l 
uses a form of density matrices in order to provide a similarity 
measure that can be used for evaluating hyponymy-hypernymy 
relations. In Sect.[V]]we indicate how these two uses of density 
matrices can be merged into one. Blacoe et al. El describe a 
distributional (but not compositional) model of meaning based 
on density matrices created by grammatical dependencies. At a 
more generic level not directly associated to density matrices, 
the application of ideas from quantum theory to language 
proved to be a very popular field of research—see for example 
the work of Bruza et al. f20l and Widdows [[Til . 

Finally, the core idea of this paper to represent ambiguous 
words as mixed states is based on material presented in the 
MSc thesis of the first author l22l and the PhD thesis of the 
second author l23l . 

II. Background 

The field of category theory aims at identifying and studying 
connections between seemingly different forms of mathemat- 


ical structures. A very representative example of its potency 
is the compositional categorical framework of Coecke et al. 
0 , which shows that a grammatical derivation defining the 
structure of a sentence is homomorphic to a linear-algebraic 
formula acting on a semantic space defined by a distributional 
model. The framework offers a concrete manifestation of the 
rule-lo-rule hypothesis l24l . and a mathematical counterpart 
to the formal semantics perspective on language. As noted 
above, the main idea is based on the fact that both the type- 
logic of the model, a pregroup grammar, and the semantic 
category, namely FHilb, possess a compact-closed structure. 
Recall that a compact closed category is a monoidal category 
in which every object A has a left and right adjoint, denoted as 
A 1 , A r respectively, for which the following special morphisms 
exist: 

ri l :I-*A®A l rf : I -» A r ® A (1) 

e l :A l ® A -+ I e r : A <g> A r -> / (2) 

These maps need to satisfy certain conditions (known as 
yanking equations) which ensure that all relevant diagrams 
commute: 

(1a 0 e‘ A ) ° (??a 0 1a) = 1a ( e A 0 1a) ° (1a <8> rf A ) = 1a (3) 
(<4 ® IaO ° (1a' 0 Va) = 1 a' (1 a*- 0 e A ) o (rf A 0 1a*-) = 1a>- 

Finally, the passage from syntax to semantics is carried out 
by a strong monoidal functor and, as a result, preserves the 
compact closed structure. Before we proceed to expand on 
the above constructions, we briefly introduce the graphical 
calculus of monoidal categories which will be used throughout 
our exposition. 


of a state as a specific vector living in that space. In our 
graphical language the unit object I can be omitted, leading 
to the following representation of states: 



Note that the second diagram from the left depicts an 
entangled state of A®B\ product states (such as the rightmost 
one) are simple juxtapositions of two states. 

B. Pregroup grammars 

A pregroup algebra 0 is a partially ordered monoid with 
unit 1 , whose each element p has a left adjoint p l and a right 
adjoint p r , conforming to the following inequalities: 

p l ■ p < 1 < p ■ p l and p ■ p r < 1 < p r ■ p (4) 

A pregroup grammar is a pregroup algebra freely generated 
over a set of basic types B including a designated end type and 
a type dictionary that assigns elements of the pregroup to the 
vocabulary of a language. For example, it is usually assumed 
that B = {n. s}, where n is the type assigned to a noun or 
a well-formed noun phrase, while s is a designated type kept 
for a well-formed sentence. Atomic types can be combined in 
order to provide types for relational words; for example, an 
adjective has type n-n l , reflecting the fact that it is something 
that expects for a noun at its right-hand side in order to return 
another noun. Similarly, a transitive verb has type n r ■ s ■ rv, 
denoting something that expects two nouns (one at each side) 
in order to return a sentence. Based on Q. for this latter case 
the pregroup derivation gets the following form: 


A. Graphical calculus 

Monoidal categories are complete with regard to a graph¬ 
ical calculus (25) which depicts derivations in their internal 
language very intuitively, thus simplifying the reading and 
the analysis. Objects are represented as labelled wires, and 
morphisms as boxes with input and output wires. The 77 - and 
e-maps are given as half-turns. 



n 1 - \ _ i v r - V._ t 

e<: * r = f~X 


Composing morphisms amounts to connecting outputs to 
inputs, while the tensor product is simply juxtaposition: 



In this language, the yanking equations ([3]) get an intuitive 
visual justification (here for the first two identities): 


T 

A A 1 A 



r 

A A r A 

i 



For a given object A, we define a state of A to be a 
morphism I —> A. If A denotes a vector space, we can think 


n • ( n ■ s ■ n) ■ n = (n ■ n r ) ■ s ■ (n l ■ n) < 1 • s • 1 < s (5) 

Let Cp denote the free compact closed category derived 
from the pregroup algebra of a pregroup grammar (26); then, 
according to 0 and 0 , the above type reduction corresponds 
to the following morphism in Cp: 

e r n ■ l s • e l n : n ■ n r ■ s ■ n l ■ n —> s (6) 

C. From syntax to semantics 

The type-logical approach presented in Sect. III-BI is com¬ 
positional, but unable to distinguish between words of the 
same type; even more importantly, the only information that a 
derivation such as the one in (O can provide to us is whether 
the sentence is well-formed or not. Distributional models of 
meaning offer a solution to the first of these problems, by 
representing a word in terms of its distributional behaviour in 
a large corpus of text. While the actual methods for achieving 
this can vary]]] the goal is always the same: to represent words 
as points of some metric space, where differences in semantic 
similarity can be detected and precisely quantified. The prime 
intuition is that words appearing in similar contexts must have 
a similar meaning [lj. The word vectors typically live in a 

See Appendix |A] for a concrete implementation. 








highly dimensional semantic space with a fixed orthonormal 
basis, the elements of which correspond to content-bearing 
words. The values in the vector of a target word wt express 
co-occurrence statistics extracted from some large corpus of 
text, showing how strongly wt is associated with each one of 
the basis words. For a concise introduction to distributional 
models of meaning see |27). 

We take (FHilb, 0 ), the category of finite dimensional 
Hilbert spaces and linear maps over the scalar field I, to be 
the semantic counterpart of Cf which, as we saw before, ac¬ 
commodates the grammar. FHilb is a dagger compact closed 
category (or, f-compact closed); that is, a symmetric compact 
closed category (so that A r = A 1 = A* for all A) equipped 
with an involutive contravariant functor f : FHilb — > FHilb 
that is the identity on objects. Concretely, in FHilb, for a 
morphism / : A —> B, its dagger f< : B —> A is simply its 
adjoint. Furthermore, e A = V A o a a* ,A for all A. 

Taking \i/j) and \<f>) to be two vectors in a Hilbert space 'H, 
e A : A* ® A ->• I is the pairing e A ({ip\, \f)) = (f/’|(M) = 
{f>\4>) and rj A = e\. This yields a categorical definition of the 
inner product: 

(V#> (7) 

In practice it is often necessary to normalise in order to 
obtain the cosine of the angle between vectors as a measure 
of semantic similarity. This measure has been widely (and 
successfully, see 0) used in distributional models. 

D. Quantizing the grammar 

We now proceed to present a solution to the second problem 
posed above, that of providing a quantified semantic represen¬ 
tation for a sentence by composing the representations of the 
words therein: in this paper we follow [28] and [ 15] and we 
achieve the transition from syntax to semantics via a strong 
monoidal functor Q: 


Q: C F -> FHilb (8) 

which can be shown to also preserve the compact structure so 
that Q(p l ) = Q(p) 1 and Q(p r ) = Q(p) r for p an object of 
C F . Since each object in FHilb is its own dual we also have 
Q(p l ) = Q{p) = Q{p r ). Moreover, for basic types, we let: 

Q(n) = N Q(s) = S (9) 

Furthermore, since Q is strongly monoidal, complex types 
are mapped to tensor product of vector spaces: 


Q{n ■ n r ) = Q{n) <g> Q(n r ) = N ® N (10) 
Q{n r ■ s ■ n l ) = Q(n r ) ® Q(s) ® Q(n l ) = N ® S ® N 

Finally, each morphism in C F is mapped to a linear map 
in FHilb. Equipped with such a functor, we can now define 
the meaning of a sentence as follows: 


Definition II. 1. Let | wf) be a vector I —> Q{jpi) correspond¬ 
ing to word Wi with type pi in a sentence W\W 2 ■ ■ ■ w n . Given 
a type-reduction a : p\ ■ P 2 • ■ ■ ■ ■ p n —> s, the meaning of the 
sentence is defined as: 

\wiW2 ■ ..Wn) := Q(a)(|wi) ® ... ® |tu„)) (11) 

Take as an example the sentence “Trembling shadows play 
hide-and-seek”, with the standard types n■ n l and n r • s-n l as¬ 
signed to adjectives and verbs, respectively. Then the adjective 
‘trembling’ will be a morphism I -> Q(n-n l ) = I N®N, 
that is, a state in the tensor product space N ® N. Note that 
this matrix defines a linear map TV —> N, an interpretation 
that is fully aligned with the formal semantics perspective: an 
adjective is a function that takes a noun as input and returns 
a modified version of it. Similarly, the verb ‘play’ lives in 
N®S®N or, equivalently, is a bi-linear map N®N -A- S (with 
a subject and an object as arguments) which returns a sentence. 
In contrast to those two relational words, the nouns ‘shadows’ 
and ‘hide-and-seek’ are plain vectors in N. The syntax of the 
sentence conforms to the following type reduction: 

(e r n-l s )o(l n -e l n -l n r-l s -e l n) : n-n l ■n-n r ■ s-n l -n -» s (12) 

which, when transferred to FHilb via Q, yields the following 
diagrammatic derivation: 



Trembling shadows play hide-and-seek 


Recall that e-morphisms (depicted as caps) in FHilb denote 
inner product, so when a relational word of order m is applied 
on an argument of order n, the result is always a tensor of order 
n + m — 2; this simply means that the computation in (fl3l > 
will be a tensor in S, serving as the semantic representation 
for the sentence. 


E. Using Frobenius Algebras in Language 

If distributional models provide a way to build meaning 
vectors for words with atomic types, the question of how to 
create a tensor representing a relational or a functional word 
is much more challenging. Following an approach that resem¬ 
bles an extensional perspective of semantics, Grefenstette and 
Sadrzadeh ||29l propose the representation of a relational word 
as the sum of its argument vectors. In other words, an adjective 
is given as JT \nourii), where i iterates through all nouns that 
the specific adjective modifies in a large corpus and | nourii) 
is the meaning vector of the vth noun; similarly, an intransitive 
verb is the sum of all its subject nouns, while a transitive verb 
(a function of two arguments) is defined as follows: 


| verbr) = E | subjf) ® | obji) (14) 

i 

While this idea is intuitive and fairly easy to implement, it 
immediately results in a mismatch between the grammatical 





types of the words and their concrete representations. The type 
of a transitive verb, for example, is n r ■ s ■ n l , but the method 
needs a concrete representation which lives in N®N. In order 
to provide a solution to this problem, Kartsaklis, Sadrzadeh, 
Pulman and Coecke fOTl propose to expand this tensor to IV® 
N<g>N using the co-monoid part of a Frobenius algebra. Later, 
Sadrzadeh et al. fl4l . If30l show how both the monoid and the 
co-monoid maps of a Frobenius algebra can be used to model 
meanings of functional words such as subjective, objective, 
and possessive relative pronouns. 

Recall from ED that a Frobenius algebra in a monoidal 
category is a quintuple (A, A, t, p, Q such that: 

• ( A , p, C) is a monoid, that is we have: 


In order to apply this method to verbs, we first need to 
notice that since now our sentence space will be essentially 
produced by copying basis elements of the noun space, our 
functor Q cannot any more apply different mappings on the 
two atomic pregroup types {s, n}\ both of these should be 
mapped onto the same basic vector space, so we get: 

Q(n) = W Q(s) = W (20) 


p — A ® A —> A C : 1 —^ A | 

satisfying associativity and unit conditions, 

(A, A, l) is a co-monoid, so that: 


(15) 


A = A^ A 



A —t I 


(16) 


satisfy co-associativity and co-unit conditions; 

• furthermore, A and p adhere to the following Frobenius 
condition'. 



In a monoidal f-category, a f-Frobenius algebra is a Frobe¬ 
nius algebra whose co-monoid is adjoint to the monoid. 
As shown in d. every finite dimensional Hilbert space 
FI with orthonormal basis {|i}} has a f-Frobenius algebra 
associated to it, the co-multiplication and multiplication of 
which corresponds to uniformly copying and uncopying the 
basis as follows: 

A :: |«) i->- |i) ® |«) t :: |j) 1-4 1 (18) 

F :: |i) <8* | j) 6ij\i) := j ^ j C " 1 ^ ^ 

Abstractly, this enables us to copy and delete the (classical) 
information relative to the given basis. Concretely, the copying 
A-map amounts to encoding faithfully the components of a 
vector in FI as the diagonal elements of a matrix in 'H 0 FI. 
while the “uncopying” operation p picks out the diagonal 
elements of a matrix and returns them as a vector in FI. 
Kartsaklis et al. m use the Frobenius co-multiplication in 
order to faithfully encode tensors of lower order constructed by 
the argument summing procedure of ll29l to higher order ones, 
thus restoring the proper functorial relation. The concrete rep¬ 
resentation of an adjective is given as A(JT | nourii)) which, 
when substituted to Definition III. II gives the composition on 
the rightH 

-Notice that the resulting normal form is just a direct application of the 

Frobenius condition (13- 


Given the above limitation, the case of an intransitive verb 
is quite similar to that of an adjective: we construct a concrete 
tensor as A(JA | subji)), the composition of which with a 
subject noun on its right-hand side proceeds as in (IT9l) due 
to the commutativity of the algebra. The case of a transitive 
verb is more interesting, since now the Frobenius structure 
offers two options: starting from a verb matrix in IF 0 W 
created as in (lf~4b . we can encode it to a tensor in W (g) W ® 
W by either copying the row dimension (responsible for the 
interaction of the verb with the subject noun) or the column 
dimension (responsible for the interaction with the object). For 
the latter case, referred to by Copy-Object, the composition 
becomes as follows: 



The composition for the case of copying the subject dimension 
proceeds similarly on the left-hand side. In practice, empirical 
work has shown that objects have stronger influence on the 
meaning of a transitive sentence than subjects fl5l , especially 
when the head verb is ambiguous. This in principle means 
that the Frobenius structure of the Copy-Object approach is a 
more effective model of sentential compositionality. 

Furthermore, Sadrzadeh et al. M exploit the abilities 
of Frobenius algebras in order to model relative pronouns. 
Specifically, copying is used in conjunction with deleting in 
order to allow the head noun of a relative clause to interact 
with its modifier verb phrase from the far left-hand side of the 
clause to its right-hand side. For the case of a relative clause 
modifying a subject this is achieved as follows: 


T 


N N N S N N S N N 


^7 [Ay 


A 

N 

^7 


N N N N 

J L 


the j 


who 


likes 


Mary the man likes 




Mary 


( 22 ) 


This concludes the presentation of the categorical compo¬ 
sitional model of Coecke et al. G) and the related research 
up to today. From the next section we start working towards 
an extension of this model capable of handling the notions of 
homonymy and polysemy in a unified manner. 










III. Understanding Ambiguity 

In order to deal with lexical ambiguity we firstly need to 
understand its nature. In other words, we are interested to study 
in what way an ambiguous word differs from an unambiguous 
one, and what is the defining quality that makes this distinction 
clear. On the surface, the answer to these questions seems 
straightforward: an ambiguous word is one with more than one 
lexicographic entries in the dictionary. However, this definition 
fits well only to homonymous cases, in which due to some 
historical accident words that share the same spelling and 
pronunciation refer to completely unrelated concepts. Indeed, 
while the number of meanings of a homonymous word such as 
‘bank’ is almost fixed across different dictionaries, the same 
is not true for the small (and overlapping) variations of senses 
that might be listed under a word expressing a polysemous 
case. 

The crucial distinction between homonymy and polysemy 
is that in the latter case a word still expresses a coherent and 
self-contained concept. Recall the example of the polysemous 
use of ‘bank’ as a financial institution and the building where 
the services of the institution are offered; when we use the 
sentence ‘I went to the bank’ (with the financial meaning of the 
word in mind) we essentially refer to both of the polysemous 
meanings of ‘bank’ at the same time—at a higher level, the 
word ‘bank’ expresses an abstract but concise concept that 
encompasses all of the available polysemous meanings. On 
the other hand, the fact that the same name can be used 
to describe a completely different concept (such as a river 
bank or a number of objects in a row) is nothing more than 
an unfortunate coincidence expressing lack of specification. 
Indeed, a listener of the above sentence can retain a small 
amount of uncertainty regarding the true intentions of the 
sayer; although her first guess would be that ‘bank’ refers to 
the dominant meaning of financial institution (including all re¬ 
lated polysemous meanings ), a small possibility that the sayer 
has actually visited a river bank still remains. Therefore, in the 
absence of sufficient context, the meaning of a homonymous 
word is more reliably expressed as a probabilistic mixing of 
the unrelated individual meanings. 

In a distributional model of meaning where a homonymous 
word is represented by a single vector, the ambiguity in 
meaning has been collapsed into a convex combination of 
the relevant sense vectors; the result is a vector that can be 
seen as the average of all senses, inadequate to reflect the 
meaning of any of them in a reliable way. We need a way 
to avoid that. In natural language, ambiguities are resolved 
with the introduction of context (recall that meaning is use), 
which means that for a compositional model of meaning the 
resolving mechanism is the compositional process itself. We 
would like to retain the ambiguity of a homonymous word 
when needed (i.e. in the absence of appropriate context) and 
allow it to collapse only when the context defines the intended 
sense, during the compositional process. 

In summary, we seek an appropriate model that will allows 
us: (a) to express homonymous words as probabilistic mixings 


of their individual meanings; (b) to retain the ambiguity until 
the presence of sufficient context that will eventually resolve 
it during composition time; (c) to achieve all the above in the 
multi-linear setting imposed by the vector space semantics of 
our original model. 

IV. Encoding Ambiguity 

The previous compositional model relies on a strong 
monoidal functor from a compact closed category, representing 
syntax, to FHilb, modelling a form of distributional seman¬ 
tics. In this section, we will modify the functor to a new 
codomain category. However, before we start, we establish a 
few guidelines: 

• our construction needs to retain a compact closed struc¬ 
ture in order to carry the grammatical reduction maps to 
the new category; 

• we wish to be able to compare the meaning of words 
as in the previous model, i.e., the new category needs to 
come equipped with a dagger structure that implements 
this comparison; 

• finally, we need a Frobenius algebra to merge and dupli¬ 
cate information in concrete models. 

To achieve our goal, we will explore a categorical con¬ 
struction, inspired from quantum physics and originally due 
to Selinger fl3l . in the context of the categorical model of 
meaning developed in the previous sections. 

A. Mixing in FHilb 

Although seemingly unrelated, quantum mechanics and 
linguistics share a common link through the framework of 
f-compact closed categories, an abstraction of the Hilbert 
space formulation, and have been used in the past ED to 
provide structural proofs for a class of quantum protocols, 
essentially recasting the vector space semantics of quantum 
mechanics in a more abstract way. Shifting the perspective 
to the field of linguistics, we saw how the same formalism 
proposes a description of the semantic interactions of words 
at the sentence level. Here we make the connection between 
the two fields even more explicit, taking advantage of the fact 
that the ultimate purpose of quantum mechanics is to deal with 
ambiguity—and this is exactly what we need to achieve here 
in the context of language. 

We start by observing that, in quantum physics, the Hilbert 
space model is insufficient to incorporate the epistemic state 
of the observer in its formalism: what if one does not have 
knowledge of a quantum system’s initial state and can only 
attribute a probability distribution to a set of possible states? 
The answer is by considering a statistical ensemble of pure 
states: for example, one may assign a \ probability that the 
state vector of a system is \ipi) and a \ probability that it is in 
state IV^}- We say that this system is in a mixed state. In the 
Hilbert space setting, such a state cannot be represented as a 
vector. In fact, any normalised sum of pure states is again 
a pure state (by the vector space structure). Note that the 
state (i/ji + ' 02)/-\/2 is a quantum superposition and not the 
mathematical representation of the mixed state above. 


This situation is similar to the issue we face when trying 
to model ambiguity in distributional semantics: given two 
different meanings of a homonymous word and their relative 
weights (given as probabilities), simply looking at the convex 
composition of the associated vectors collapses the ambiguous 
meaning to a single vector, thereby fusing together the two 
senses of the word. This is precisely what was discussed in 
Sect. [HI] The mathematical response to this problem is to move 
the focus away from states in a Hilbert space to a specific kind 
of operators on the same space: more specifically, to density 
operators, i.e., positive semi-definite, self-adjoint operators of 
trace one. The density operator formalism is our means to 
express a probability distribution over the potential meanings 
of a homonymous word in a distributional model. We formally 
define this as follows: 

Definition IV. 1. Let a distributional model be given in the 
form of a Hilbert space M, in which every word Wt is 
represented by a statistical ensemble {(pi, |tuj))}j —where 
| w\) is a vector corresponding to a specific unambiguous 
meaning of the word that can occur with probability pi. The 
distributional meaning of the word is defined as: 

PK)=E^K)KI (23) 

i 

In conceptual terms, mixing is interpreted as ambiguity 
of meaning: a word wt with meaning given by p{wf) can 
have pure meaning w\ with probability p,. Note that for 
the case of a non-homonymous word, the above formula 
reduces to \wt){wt\, with |uy) corresponding to the state vector 
assigned to uy. Now, if mixed states are density operators, 
we need a notion of morphism that preserves this structure, 
i.e., that maps states to states. In the Hilbert space model, 
the morphisms were simply linear maps. The corresponding 
notion in the mixed setting is that of completely positive map, 
that is, positive maps that respect the monoidal structure of 
the underlying category. 

To constitute a compositional model of meaning, our con¬ 
struction also needs to respect our stated goals: specifically, 
the category of operator spaces and completely positive maps 
must be a f-compact closed category; furthermore, we need 
to identify the morphism that plays the part of the Frobenius 
algebra of the previous model. We start working towards 
these goals by describing a construction that builds a similar 
category, not only from FHilb, but, more abstractly, from any 
f-compact closed category. 

B. Doubling and complete positivity 

The category that we are going to build was originally intro¬ 
duced by Selinger 03 as a generalisation of the corresponding 
construction on Hilbert spaces. Conceptually, it corresponds 
to shifting the focus away from vectors or morphisms of the 
form I — > A to operators on the same space or morphisms of 
type A —> A. We will formalise this idea by first introducing 
the category D(C) on a compact closed category C, which 
can be perhaps better understood in its diagrammatic form as 


a doubling of the wires. In this context, we obtain a duality 
between states of D(C) and operators of C, pictured by simple 
wire manipulations. As we will see, D(C) retains the compact 
closedness of C and is therefore a viable candidate for a 
semantic category in our compositional model of meaning. 
However, at this stage, states of D(C) do not yet admit a 
clear interpretation in terms of mixing. This is why we need 
to introduce the notion of completely positive morphisms, of 
which positive operators on a Hilbert space (mixed states in 
quantum mechanics) are a special case. This will allow us later 
to define the subcategory CPM(C) of D(C). 

1) The D construction — doubling: First, given a f-compact 
closed category]! C we define: 

Definition IV.2. The category D(C) with 

• the same objects as C; 

• morphisms between objects A and B o/D(C) are mor¬ 
phisms A ® A* —> B (g> B* of C. 

• composition and dagger are inherited from C via the 
embedding E : D(C) C defined by 

{A K > A ® A* on objects ; 

\ f f on morphisms. 

In addition, we can endow the category D(C) of a monoidal 
structure by defining the tensor 0n by 

A ® D B = A ® B 

on objects A and B, and for morphisms /) : A® A* —> B®B* 
and / 2 : C <g) C* ->• D <g> D*, by: 

fi ®d h ■■ A ® C ® C* ® A* A ® A* ® C ® C* 

B®B*®D®D*^B®D®D*®B* (24) 


Or graphically by. 



where the arrow i —> represents the functor E and we use the 
convention of depicting morphisms in D(C) with thick wires 
and boxes to avoid confusion. Note that the intuitive alternative 
of simply juxtaposing the two morphisms as we would in C 
fails to produce a completely positive morphism in general, as 
will become clearer when we define completely positivity in 
this context. 

This category carries all the required structure. We refer the 
reader to (13| for a proof of the following: 

Proposition IV. 1. The category D(C) inherits a f -compact 
closed structure from C via the strict monoidal functor M : 

3 The construction works on any monoidal category with a dagger, i.e., an 
involution, but we will not need the additional generality. 













C —> D(C) defined inductively by 


of operators in FHilb): 


f /i®/2 M(/i) ® D M(f 2 ) ; 

< A K > A on objects ; 

I / H► / 0 /* on morphisms. 

where /* = (/')* by definition. 

The functor M shows that we are not losing any expressive 
power since unambiguous words (represented as maps of 
C ) still admit a faithful representation in doubled form. For 
reference, the reader can find in Fig. Q] a dictionary that 
translates useful diagrams from one category to the other. 

Now, notice that we have a bijective correspondence be¬ 
tween states of D(C), i.e., morphisms I —> A and operators 
on A in C. Explicitly, the map C(A,A) —y C(I, A ® A*) is, 
for an operator p : A —> A, 


p h» r (0 “ l = (p (gi 1^,) o p A , = 



(26) 


that is easily seen to be an isomorphism by bending back the 
rightmost wireQ In the special case of states, the generalised 
inner product generated by the dagger functor can be computed 
in terms of the canonical trace induced by the compact closed 
structure (and reduces to the usual inner product on a space 


4 An application of the yanking equations j3). 



We now proceed to introduce complete positivity. 

2) The CPM construction—complete positivity I\I3\I : 

Definition IV.3. A morphism f : A —> B of D(C) is 
completely positive if there exists an object C and a morphism 
k : C (S> A —y B, in C, such that f embeds in C as 
(k Cg) fc*) o ( 1 ^ (g) 77c* C 3 > 1 a*) or, pictorially, 

B B B* 

p [—1 pL (28) 

A A A* 

From this last representation, we easily see that the compo¬ 
sition of two completely positive maps is completely positive. 
Similarly, the tensor product of two completely positive maps 
is completely positive. Therefore, we can define: 

Definition IV.4. The category CPM(C) is the subcategory 
of D(C) whose objects are the same and morphisms are 
completely positive maps. 

CPM(C) is monoidal and Cg>cPM = ®d- We easily re¬ 
cover the usual notion of positive operator from this definition: 



D(C) 


1a 



C 



D(Q 

B 

i 




$ 


A A* 



B B 




A A* 
C C* 



A A* 


with pure states corresponding to the disconnected case. 

Finally, from definition II V.3I it is clear that, for a morphism 
/ of C, M(f) = / <8> /* is completely positive. Thus, 

Proposition IY.2. M factors through the embedding I : 
CPM(C) D(C), i.e., there exists a strictly monoidal 
functor M : C —f CPM(C) such that M = IM. 


Frob. A Y Y Y 

Frob. l ^ I | 


Frob. [i 


Frob. C 


r\ 

A 

I 


r\r\ 



3) Frobenius algebras: We are only missing a Frobenius 
algebra to duplicate and delete information as necessary (Sect. 
UlB. It is natural to first consider the doubled version of 
Frobenius algebras in C, i.e. the f-Frobenius algebra whose 
copying map is M(A) and whose deleting map is M(t), as 
doubling preserves both operations. In addition, the monoid 
operation is clearly completely positive. In more concrete 
terms, the monoid operation is precisely the point-wise (some¬ 
times called Hadamard) product of matrices. The morphisms 
of this Frobenius algebra are shown in Fig. Q] 


Fig. 1. Translation from C to D(C). 

















C. Categorical Model of Meaning: Reprise 

We are now ready to put together all the concepts introduced 
above in the context of a compositional model of meaning. 
Our aim in this section is to reinterpret the previous model 
of 0 as a functor from a compact closed grammar to the 
category CPM (C),for any compact closed category C. Given 
semantics in the form of a strong monoidal functor Q : Cf —> 
C, our model of meaning is defined by the composition: 

MQ : C F -> C -A CPM(C) (30) 

Since M sends an object A to the same A in CPM(C), 
the mapping of atomic types, their duals and relational types 
of the grammar occur in exactly the same fashion as in the 
previous model. Furthermore, note that Q is strongly monoidal 
and M is strictly monoidal, so the resulting functor is strongly 
monoidal and, in particular, preserves the compact structure. 
Thus, we can perform type reductions in CPM(C) according 
to the grammatical structure dictated by the category Cf- 

Note that we have deliberately abstracted the model to 
highlight its richness—the category C could be any compact 
closed category: FHilb, the category Rel of sets and relations 
(in which case we recover a form of Montague semantics 0) 
or, as we will see in Sect. ED even another iteration of the 
CPM construction. 

Definition IV.5. Let p{wf) be a meaning state I —> MQ{jpf) 
corresponding to word Wi with type pi in a sentence w\ ... w n . 
Given a type-reduction a : pi ■ ... ■ p n —> s, the meaning of 
the sentence is defined as: 

p(wi. ..w n ) := MQ(a)(p(w i) ®cpm • ■ ■ ®cpm p(w n )) 

For example, assigning density matrix representations to the 
words in the previous example sentence “trembling shadows 
play hide and seek”, we obtain the following meaning repre¬ 
sentation: 

NN N NSN N NN N NSN N 

Trembling shadows play hide-and-seek 



Diagrammatically, it is clear that in the new setting the 
partial trace implements meaning composition. Note that 
diagrams as the above illustrate the flow of ambiguity or 
information between words. How does ambiguity evolve when 
composing words to form sentences? This question is very 
hard to answer precisely in full generality. The key message 
is that (unambiguous) meaning emerges in the interaction of 
a word with its context, through the wires. This process of 
disambiguation is perhaps better understood by studying very 
simple examples. For instance, it is interesting to examine 
the interaction of an ambiguous word with a pure meaning 
word to build intuition—for example the particular interaction 
of an ambiguous noun with an unambiguous verb. In fact. 


since density operators are convex sums of pure operators, all 
interactions are convex combinations of this simple form of 
word composition. In addition, disambiguation is one of the 
key NLP tasks on which the previous compositional models 
were tested and thus constitutes an interesting case study. 


D. Introducing ambiguity in formal semantics 

Here, we will work in the category CPM(Rel). We recall 
that Rel is the f-compact category of sets and relations. 
The tensor product is the Cartesian product and the dagger 
associates to a relation its opposite. Let our sentence set 
be S = {true, false}. In Rel, this means that we are 
only interested in the truth of a sentence, as in Montague 
semantics. In this context, nouns are subsets of attributes. 
Given a context to which we pass the meaning of a word, the 
meaning of the resulting sentence can be either | false), \ true) 
or | false) + \true), the latter representing superposition, i.e., 
the case for which the context is insufficient to determine the 
truth of all the attributes of the word (classically, this can be 
identified with false). 

On the other hand, in the internal logic of CPM(Rel), 
mixing will add a second dimension that can be interpreted as 
ambiguous meaning, regardless of truth. The possible values 
are now: 


C N t 

4 ^ 

ambiguous context 
word 


\true)(true\, 

\false)(false\, 

(| true) + \false))((true\ + (false |), 
15 


where the identity on S represents ambiguity. 

Note that we use Dirac notation in Rel rather than set 
theoretic union and cartesian product, since elements in finite 
sets can be seen as basis vectors of free modules over the 
semi-ring of Booleans; a binary relation can be expressed as 
an adjacency matrix. The trace of a square matrix picks out 
the elements for which the corresponding relation is reflexive. 

Consider the phrase ‘queen rules’. We allow a few highly 
simplifying assumptions: first, we restrict our set of nouns to 
the rather peculiar ‘Freddy Mercury’, ‘Brian May’, ‘Elisabeth 
II’, ‘chess’, ‘England’ and the empty word e. Moreover, 
we consider the verb ‘rule’, supposed to have the following 
unambiguous meaning: 


| rule) = | band ) ® \true) ® |e) + | chess) ® \false) ® |e) 

+\elisabeth) ® | true) <g> | england) 


with the obvious {band) = \freddy) + \brian). This definition 
reflects the fact that a band can rule (understand “be the best”) 
as well as a monarch. Finally, the ambiguous meaning of 
‘queen’ is represented by the following operator: 


p(queen ) = \elisabeth) (elisabeth\ 

+ \band) (band\ + \chess){chess\ 


With the self-evident grammatical structure we can compute 
the meaning of the sentence diagrammatically: 








N NSN' N' N NSN' N’ 


$ ^ l 

queen rule 



which, in algebraic form, yields, Trjv(|rrtZe)(ru^e| o 
(Ti'Ar/(p(grteen))® 1^)) = ls|f|. In other words, the meaning 
of the sentence is neither true nor false but still ambigu¬ 
ous. This is because the context that we pass to ‘queen’ is 
insufficient to disambiguate it (the band or the monarch can 
rule). 

Now, if we consider ‘queen rules England’, the only 
matching pattern in the definition of |rti(e) is | elisabeth) 
which corresponds to a unique and therefore unam¬ 
biguous meaning of p(queen). Hence, a similar cal¬ 
culation yields TrAr(|rit/e}(ruZe| o (Tr^/ (p(queen)) (g) 
\england)(england\)) = \true)(true\ and the sentence is 
not only true but unambiguous. In this case, the context was 
sufficient to disambiguate the meaning of the word ‘queen’. 

E. Flow of information with ]-Frobenius algebras 

In the above examples we used the assumption that a verb 
tensor had been faithfully constructed according to its gram¬ 
matical type. However, as we saw in section 111-El concrete 
constructions might yield operators on a space of tensor order 
lower than the space to which the functor MQ maps their 
grammatical type. As before, f-Frobenius algebras can be used 
to solve this type mismatch and encode the information carried 
by an operator into tensors of higher order. 

Assume that we have a distributional model in the form 
of a vector space W with a distinguished basis and density 
matrices on W and W Q> W to represent the meaning of our 
nouns and verbs, respectively. Using the doubled version of 
the f-Frobenius algebra induced by the basis (as well as the 
proven empirical method of copying the object) our example 
sentence is given by: 

rCiA A 

w w ww w 

Trembling shadows play hide-and-seek 


W 


W 


WW 


W 



In addition to being a convenient way for creating verb 
tensors, the application of Frobenius algebras in the new model 
has another very important practical advantage: it results a sig¬ 
nificant reduction in the dimensionality of the density matrices, 
mitigating space complexity problems that might be created 
from the imposed doubling in a practical implementation. 

5 Note that we delete the subject dimension of the verb in order to reflect 
the fact that it is used intransitively. 


Relative Clauses 

noun: verb\/verb 2 

noun 

noun that verb\ 

noun that verb 2 

organ : enchant/ache 

0.18 

0.11 

0.08 

vessel : swell/sail 

0.25 

0.16 

0.01 

queen : fly/rule 

0.28 

0.14 

0.16 

nail: gleam/grow 

0.19 

0.06 

0.14 

bank : overflow/loan 

0.21 

0.19 

0.18 

Adjectives 

noun: adj\/adj 2 

noun 

adji noun 

adj 2 noun 

organ: music/body 

0.18 

0.10 

0.13 

vessel: blood/naval 

0.25 

0.05 

0.07 

queen: fair/chess 

0.28 

0.05 

0.16 

nail: rusty/finger 

0.19 

0.04 

0.11 

bank: water/financial 

0.21 

0.20 

0.16 


TABLE I 

Computing Entropy for Nouns Modified by Relative Clauses 
and Adjectives. 


F. Measuring ambiguity with real data 

While a large-scale experiment is out of the scope of this 
paper, in this section we present some preliminary witness¬ 
ing results that showcase the potential of the model. Using 
2000 -dimensional meaning vectors created by the procedure 
described in Appendix [A] we show how ambiguity evolves for 
five ambiguous nouns when they are modified by an adjective 
or a relative clause. For example, ‘nail’ can appear as ‘rusty 
nail’ or ‘nail that grows’; in both cases the modifier resolves 
part of the ambiguity, so we expect that the entropy of the 
larger compound would be lower than that of the original 
ambiguous noun. Both types of composition use the Frobenius 
framework described in Sect. III-E1 specifically, composing 
an adjective with a noun follows ( IT9l ). while for the relative 
pronoun case we use (1221 ). We further remind that for a density 
matrix p with eigen-decomposition p = e il e i)( e il» Von 

Neumann entropy is given as: 

S(p) = - Tr(p In p) = - ^2 e i In a (31) 

i 

The results are presented in Table Q] Note that the entropy 
of the compounds are always lower than that of the ambiguous 
noun. Even more interestingly, for some cases (e.g ‘vessel that 
sails’) the context is so strong that is capable to almost purify 
the meaning of the noun. This demonstrates an important 
aspect of the proposed model: disambiguation = purification. 

Finally, the fact that the composite semantic representations 
reflect indeed their intended meaning has been verified by 
performing a number of informal comparisons; for example, 
‘queen that flies’ was close to ‘bee’, but ‘queen that rules’ 
was closer to ‘palace’; ‘water bank’ was closer to ‘fish’, but 
‘financial bank’ was closer to ‘money’, and so on. 

V. Non-commutativity 

If the last section was concerned with applications of the 
CPM-construction to model ambiguity, here we discuss the 
role of the D-construction for the same purpose. Frobenius 
algebras on objects of D(C) are not necessarily commutative 
and thus their associated monoid is not a completely positive 























morphism. In the quantum physical literature, non-completely 
positive maps are not usually considered since they are not 
physically realisable. However, in linguistics, free from these 
constraints, we could theoretically venture outside of the 
subcategory CPM(C), deep into D(C). More general states 
could appear as a result of combining mixed states according 
to the reduction rules of our compositional model. There is 
no reason for such an operator to be a mixed state itself since 
there is no constraint in our model that requires sentences to 
decompose into a mixture of atomic concepts. 

A. Non-commutativity and complete positivity 

Coecke, Heunen and Kissinger |32j| introduced the category 
CP*(C) of f-Frobenius algebras (with additional technical 
conditions) and completely positive maps, over an arbitrary 
f-compact category C, in order to study the interaction of 
classical and quantum systems in a single categorical setting: 
classical systems are precisely the commutative algebras and 
completely positive maps are quantum channels, that is, phys¬ 
ically realisable processes between systems. Interestingly, in 
accordance with the content of the no-broadcasting theorem 
for quantum systems the multiplication of a commutative 
algebra is a completely positive morphism while the mul¬ 
tiplication of a non-commutative algebra is not. It is clear 
that the meaning composition of words in a sentence is only 
commutative in exceptional cases. The non commutativity of 
the grammatical structure reflects this. However, in earlier 
methods of composition, this complexity was lost in translation 
when passing to semantics. 

With linguistic applications in mind, the CP* construction 
suggests various ways of composing the meaning of words, 
each corresponding to a specific Frobenius algebra operation. 
Conceptually, this idea makes sense since a verb does not 
compose with its subject in the same way that an adjective 
composes with the noun phrase to which it applies. The various 
ways of composing words may also offer a theoretical base 
for the introduction of logic in distributional models of natural 
language. 

B. A purely quantum algebra 

This is where the richness of D(C) reveals itself: algebras 
in this category are more complex and, in particular, allow 
us to study the action of non-commutative structures—a topic 
of great interest to formal linguistics where the interaction of 
words is highly non-commutative. Hereafter we introduce a 
non-commutative f-Frobenius algebra that is not the doubled 
image of any algebra in C. 

Definition V.l. For every object A of D(C), the morphisms 
of D(C), p : A A —> A defined by the following diagram 
in C: 



(1a 8 e A 8 1a*) ° (1,4®a <8 cta.a*) 


and t : I —> A with the following definition in C: 


X ^ = PA* 

are the multiplication and unit of a ]-Frobenius algebra J- d— 
where a is the natural swap isomorphism in C. 

Proof that the above construction is indeed a f-Frobenius 
algebra can be found in lf33l . The action of the Frobenius 
multiplication p on states I —> A of D(C) is particularly 
interesting; in fact, it implements the composition of operators 
of C, in D(C), as evidenced by the next diagram: 



The meaning of the “trembling shadows...” sentence using 
the algebra J-d becomes: 

Trembling shadows play hide-and-seek 

How does composition with the new algebra affect the flow 
of ambiguity in the simple case of an ambiguous word to 
which we pass an unambiguous context? Given a projec¬ 
tion onto a one-dimensional subspace |tn)(u>| and a density 
operator p, the composition \w)(w\p is a ( not necessarily 
orthogonal) projection. In a sense, the meaning of the pure 
word determined that of the ambiguous word as evidenced by 
the disconnected topology of the following diagram: 



A 

N N 

4 4 1 

pure ambiguous 
context word 



VI. Adding lexical entailment 

We now demonstrate the advantage of fact that the CPM- 
construction is an abstract construction, and hence can be 
applied to any suitable (i.e. living in a f-compact closed 
category) model of word meaning. 

Besides ambiguity, another feature of language which is 
not captured by the distributional model is the fact that 
the meaning of one word (= hypernym ) generalises that of 
another word (= hyponym ). This points at a partial ordering of 
word meanings. For example, ‘painter’ generalises ‘Brueghel’. 
Density matrices can be endowed with a partial ordering which 
could play that role, e.g. the Bayesian ordering 1341 . This 
raises the question of how to accommodate both features 
together in a model of natural language meaning. 









Since CPM(C) is always f-compact closed, a canonical 
solution is obtained by iterating the CPM-construction: 



Given a word/phrase/sentence meaning: 



lack of any ambiguity or generality correspond to distinct 
diagrams, respectively: 








Ambiguity or generality can be then measured by taking the 
von Neumann entropy of the following operators respectively: 



T T 
LZ3 [ZD 


CZD UU 


VII. Conclusion and Future Work 


In this paper we detailed a compositional distributional 
model of meaning capable of explicitly handling lexical ambi¬ 
guity. We discussed its theoretical properties and demonstrated 
its potential for real-world natural language processing tasks 
by a small-scale experiment. A large-scale evaluation will 
be our challenging next step, aiming to provide empirical 
evidence regarding the effectiveness of the model in general 
and the performance of the different Frobenius algebras in 
particular. On the theoretical side, the logic of ambiguity 
in CPM(Rel), the non-commutative features of the D- 
construction as well as further exploration of nested levels 
of CPM, each deserve a separate treatment. In addition one 
important weakness of distributional models is the represen¬ 
tation of words that serve a purely logical role, like logical 
connectives or negation. Density operators support a form of 
logic whose distributional and compositional properties could 
be examined, potentially providing a solution to this long¬ 
standing problem of compositional distributional models. 
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Appendix A 

From Theory to Practice 

The purpose of this appendix is to show how the theoretical 
ideas presented in this paper can take a concrete form using 
standard natural language processing techniques. The setting 
we present below has been used for the mini-experiments in 
Sect. IIV-FI We approach the creation of density matrices as a 
three-step process: (a) we first produce an ambiguous semantic 
space; (b) we apply a word sense induction method on it in 
order to associate each word with a set of sense vectors; and 
finally (c) we use the sense vectors in order to create a density 
matrix for each word. These steps are described in separate 
sections below. 


A. Creating a Concrete Semantic Space 

We train our basic vector space using ukWaC, a corpus 
of English text with 2 billion words (100 million sentences). 
The basis of the vector space consists of the 2,000 most 
frequent content words (nouns, verbs, adjectives, and adverbs), 
excluding a list of stop worc/.vf] Furthermore, the vector space 
is lemmatized and unambiguous regarding syntactic informa¬ 
tion; in other words, each vector is uniquely identified by a 
(lemma,pos-tag) pair, which means for example that ‘book’ 
as a noun and ‘book’ as a verb are represented by different 
meaning vectors. The weights of each vector are set to the ratio 
of the probability of the context word c; given the target word 
t to the probability of the context word overall, as follows: 


Vi(t) 


V{cj\t) 

P{Ci) 


count (ci,t) • count (total) 
count(t) • count(ci) 


where count (ci,t) refers to how many times Ci appears in the 
context of t (that is, in a 5-word window at either side of t) 
and count (total) is the total number of word tokens in the 
corpus. 


B. Word Sense Induction 

The notion of word sense induction, that is, the task of 
detecting the different meanings under which a word appears 
in a text, is intimately connected with that of distributional 
hypothesis—that the meaning of a word is always context- 
dependent. If we had a way to create a vectorial representation 


6 That is, very common words with low information content, such as the 
verbs ‘get’ and ‘take’ or adverbs like ‘really’ and ‘always’. 


Meaning 1: 24070 contexts 

port owner cargo fleet sailing ferry craft Navy merchant cruise 

navigation officer metre voyage authority deck coast launch fishery 
island charter Harbour pottery radio trip pay River Agency Scotland 
sell duty visit fish insurance skipper Roman sink War shore sail 
town Coastguard assistance Maritime registration call rescue bank 
Museum captain incident customer States yacht mooring barge 
comply landing Ireland sherd money Scottish tow tug maritime 
wreck board visitor tanker freight purchase lifeboat _ 

Meaning 2: 5930 contexts 

clot complication haemorrhage lymph stem VEGF Vitamin glucose 

penis endothelium retinopathy spasm antibody clotting AMD 
coagulation marrow lesion angina blindness medication graft vitamin 
vasoconstriction virus proliferation Ginkgo diabetic ventricle 
thickening tablet anaemia thrombus Vein leukocyte scleroderma 
stimulation degeneration homocysteine Raynaud breathe mediator 
Biloba Diabetes LDL metabolism Gene infiltrate atheroma arthritis 
lymphocyte lobe C’s histamine melanoma gut dysfunction vitro 
triglyceride infarction lipoprotein 


TABLE II 

Derived Meanings for Word ‘Vessel’. 


for the contexts in which a specific word occurs, then, a clus¬ 
tering algorithm could be applied in order to create groupings 
of these contexts that hopefully reveal different usages of the 
word —different meanings —in the training corpus. 

This intuitive idea was first presented by Schiitze 12 in 
1998, and more or less is the cornerstone of every unsuper¬ 
vised word sense induction and disambiguation method based 
on semantic word spaces up to today. The approach we use is 
a direct variation of this standard technique. For what follows, 
we assume that each word in the vocabulary has already been 
assigned to an ambiguous semantic vector by following typical 
distributional procedures, for example similar to the setting 
described in Sect. IA-AI 

We assume for simplicity that the context is defined at 
the sentence level. First, each context for a target word wt 
is represented by a context vector of the form - "YJi-i \ w t)’ 
where | Wi) is the semantic vector of some other word w, ^ Wt 
in the same context. Next, we apply hierarchical agglomerative 
clustering on this set of vectors in order to discover the latent 
senses of w t . Ideally, the contexts of w t will vary according to 
the specific meaning in which this word has been used. Table 
HI] provides a visualization of the outcome of this process for 
the ambiguous word ‘vessel’. Each meaning is visualized as a 
list of the most dominant words in the corresponding cluster, 
ranked by their TF-IDF values. 

We take the centroid of each cluster as the vectorial rep¬ 
resentation of the corresponding sense/meaning. Thus, each 
word w is initially represented by a tuple (|tu), S w ), where 
|ui) is the ambiguous semantic vector of the word as created 
by the usual distributional practice, and S w is a set of sense 
vectors (that is, centroids of context vectors clusters) produced 
by the above procedure. 

Note that our approach takes place at the vector level (as 
opposed to tensors of higher order), so it provides a natural 
way to create sets of meaning vectors for “atomic” words of 
the language, that is, for nouns. It turns out that the generaliza- 














tion of this to tensors of higher order is straightforward, since 
the clustering step has already equipped us with a number of 
sets consisting of context vectors, each one of which stands 
in one-to-one correspondence with a set of contexts reflecting 
a different semantic usage of the higher-order word. One then 
can use, for example, the argument “tensoring and summing” 
procedure of ll29l (briefly described in Sect. III-Eb in order to 
compute the meaning of the ith sense of a word of arity n as: 

n 

\word)i = | arg k , c ) (32) 

ceCi fc=1 

where C, is the set of contexts associated with the /th sense, 
and arg k c denotes the fcth argument of the target word in 
context c. Of course, more advanced statistical methods could 
be also used for learning the sense tensors from the provided 
partitioning of the contexts, as long as these methods respect 
the multi-linear nature of the model. This completes the word 
sense induction step. 

C. Creating Density Matrices 

We have now managed to equip each word with a set 
of sense vectors (or higher-order tensors, depending on its 
grammatical type). Assigning a probability to each sense is 
trivial and can be directly derived by the number of times the 
target word occurs under a specific sense divided by the total 
occurrences of the word in the training corpus. This creates a 
statistical ensemble of state vectors and probabilities that can 
be used for computing a density matrix for the word according 
to Definition II V. 1 1 






